vision dataset
Diversify Your Vision Datasets with Automatic Diffusion-based Augmentation
Many fine-grained classification tasks, like rare animal identification, have limited training data and consequently classifiers trained on these datasets often fail to generalize to variations in the domain like changes in weather or location. As such, we explore how natural language descriptions of the domains seen in training data can be used with large vision models trained on diverse pretraining datasets to generate useful variations of the training data. We introduce ALIA (Automated Language-guided Image Augmentation), a method which utilizes large vision and language models to automatically generate natural language descriptions of a dataset's domains and augment the training data via language-guided image editing. To maintain data integrity, a model trained on the original dataset filters out minimal image edits and those which corrupt class-relevant information. The resulting dataset is visually consistent with the original training data and offers significantly enhanced diversity. We show that ALIA is able to surpasses traditional data augmentation and text-to-image generated data on fine-grained classification tasks, including cases of domain generalization and contextual bias. Code is available at https://github.com/lisadunlap/ALIA.
We carried out an additional continual learning experiment on eight tasks (as in 1 [33, manuscript]) that consist of vision datasets with different domains: {CIFAR-10 / CIFAR-100 / MNIST / SVHN /
Figure 1: Results on 8 visually different tasks (left), comparison with Re-training (middle), and Atari RL (right). AlexNet and re-trains for each task with the entire training sets observed so far. We respectively disagree that our paper has only incremental contributions. In our opinion, "It is somehow Eq. (5) is a general expression that defines the proximal gradient descent. Our method can be also applied to the online continual learning setting.
3DCoMPaT200: Language Grounded Large-Scale 3D Vision Dataset for Compositional Recognition
Understanding objects in 3D at the part level is essential for humans and robots to navigate and interact with the environment. Current datasets for part-level 3D object understanding encompass a limited range of categories. For instance, the ShapeNet-Part and PartNet datasets only include 16, and 24 object categories respectively. The 3DCoMPaT dataset, specifically designed for compositional understanding of parts and materials, contains only 42 object categories. To foster richer and fine-grained part-level 3D understanding, we introduce 3DCoMPaT200, a large-scale dataset tailored for compositional understanding of object parts and materials, with 200 object categories with approximately 5 times larger object vocabulary compared to 3DCoMPaT and almost 4 times larger part categories.
Diversify Your Vision Datasets with Automatic Diffusion-based Augmentation
Many fine-grained classification tasks, like rare animal identification, have limited training data and consequently classifiers trained on these datasets often fail to generalize to variations in the domain like changes in weather or location. As such, we explore how natural language descriptions of the domains seen in training data can be used with large vision models trained on diverse pretraining datasets to generate useful variations of the training data. We introduce ALIA (Automated Language-guided Image Augmentation), a method which utilizes large vision and language models to automatically generate natural language descriptions of a dataset's domains and augment the training data via language-guided image editing. To maintain data integrity, a model trained on the original dataset filters out minimal image edits and those which corrupt class-relevant information. The resulting dataset is visually consistent with the original training data and offers significantly enhanced diversity.
Technical note on calibrating vision-language models under covariate shift
Khan, Behraj, Qureshi, Rizwan, Syed, Tahir
Despite being a successful example of emerging capability, vision-language foundation models for low-shot vision classification have a limited ability to sufficiently generalize to the target data distribution due to sample poverty, leading to sensitivity to variations in the data. A popular mitigation strategy is finetuning over multiple datasets, but domain generalization is expensive when practiced in this manner. This work examines both covariate shift between pre-training data and the underspecified target data, and \textit{confidence misalignment}, where the model's prediction confidence amplified by the limited data availability. We propose \textit{Confidence-Calibrated Covariate Shift Correction ($C3SC$)}, a unified framework to mitigate both covariate shift and confidence misalignment. $C3SC$ leverages Fisher information penalty for covariate shift correction and confidence misalignment penalty (CMP) to lower confidence on misclassified examples. Experimental results across various vision and covariate shift datasets demonstrates that $C3SC$ significantly improves in calibration (ECE) by $5.82\%$ at maximum. $C3SC$ shows better robustness as well by showing $3.5\%$ improvement in accuracy metric on challenging covariate shift datasets, making $C3SC$ a promising solution for reliable real-world vision-language low-shot applications under distribution shift.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States (0.04)
- Asia > Pakistan > Sindh > Karachi Division > Karachi (0.04)
Review for NeurIPS paper: Fast Adversarial Robustness Certification of Nearest Prototype Classifiers for Arbitrary Seminorms
Additional Feedback: Overall this paper is well presented and technically sound. However, I believe its technical contribution is minor and it does not have significant impact to this field. Thus I vote for a weak reject. To increase the contribution of this paper, the authors can consider designing training algorithms that improves the provable robustness of NPCs. For example, RSLVQ is a strong method (in Table 1 it achieves very competitive clean test error); can we improve its robustness to the same level of other baselines?
Diversify Your Vision Datasets with Automatic Diffusion-based Augmentation
Many fine-grained classification tasks, like rare animal identification, have limited training data and consequently classifiers trained on these datasets often fail to generalize to variations in the domain like changes in weather or location. As such, we explore how natural language descriptions of the domains seen in training data can be used with large vision models trained on diverse pretraining datasets to generate useful variations of the training data. We introduce ALIA (Automated Language-guided Image Augmentation), a method which utilizes large vision and language models to automatically generate natural language descriptions of a dataset's domains and augment the training data via language-guided image editing. To maintain data integrity, a model trained on the original dataset filters out minimal image edits and those which corrupt class-relevant information. The resulting dataset is visually consistent with the original training data and offers significantly enhanced diversity.
The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources
Longpre, Shayne, Biderman, Stella, Albalak, Alon, Schoelkopf, Hailey, McDuff, Daniel, Kapoor, Sayash, Klyman, Kevin, Lo, Kyle, Ilharco, Gabriel, San, Nay, Rauh, Maribeth, Skowron, Aviya, Vidgen, Bertie, Weidinger, Laura, Narayanan, Arvind, Sanh, Victor, Adelani, David, Liang, Percy, Bommasani, Rishi, Henderson, Peter, Luccioni, Sasha, Jernite, Yacine, Soldaini, Luca
Foundation model development attracts a rapidly expanding body of contributors, scientists, and applications. To help shape responsible development practices, we introduce the Foundation Model Development Cheatsheet: a growing collection of 250+ tools and resources spanning text, vision, and speech modalities. We draw on a large body of prior work to survey resources (e.g. software, documentation, frameworks, guides, and practical tools) that support informed data selection, processing, and understanding, precise and limitation-aware artifact documentation, efficient model training, advance awareness of the environmental impact from training, careful model evaluation of capabilities, risks, and claims, as well as responsible model release, licensing and deployment practices. We hope this curated collection of resources helps guide more responsible development. The process of curating this list, enabled us to review the AI development ecosystem, revealing what tools are critically missing, misused, or over-used in existing practices. We find that (i) tools for data sourcing, model evaluation, and monitoring are critically under-serving ethical and real-world needs, (ii) evaluations for model safety, capabilities, and environmental impact all lack reproducibility and transparency, (iii) text and particularly English-centric analyses continue to dominate over multilingual and multi-modal analyses, and (iv) evaluation of systems, rather than just models, is needed so that capabilities and impact are assessed in context.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- (15 more...)
- Research Report (1.00)
- Overview (1.00)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Energy (1.00)
- (4 more...)
An Unbiased Look at Datasets for Visuo-Motor Pre-Training
Dasari, Sudeep, Srirama, Mohan Kumar, Jain, Unnat, Gupta, Abhinav
Visual representation learning hold great promise for robotics, but is severely hampered by the scarcity and homogeneity of robotics datasets. Recent works address this problem by pre-training visual representations on large-scale but out-of-domain data (e.g., videos of egocentric interactions) and then transferring them to target robotics tasks. While the field is heavily focused on developing better pre-training algorithms, we find that dataset choice is just as important to this paradigm's success. After all, the representation can only learn the structures or priors present in the pre-training dataset. To this end, we flip the focus on algorithms, and instead conduct a dataset centric analysis of robotic pre-training. Our findings call into question some common wisdom in the field. We observe that traditional vision datasets (like ImageNet, Kinetics and 100 Days of Hands) are surprisingly competitive options for visuo-motor representation learning, and that the pre-training dataset's image distribution matters more than its size. Finally, we show that common simulation benchmarks are not a reliable proxy for real world performance and that simple regularization strategies can dramatically improve real world policy learning. https://data4robotics.github.io
NeuralLabeling: A versatile toolset for labeling vision datasets using Neural Radiance Fields
Erich, Floris, Chiba, Naoya, Yoshiyasu, Yusuke, Ando, Noriaki, Hanai, Ryo, Domae, Yukiyasu
Models trained using weakly supervised learning might outperform stateof-the-art models when the SOTA models are not trained on task specific data, but their performance is lower than Specialized labeling tools are essential for labeling vision SOTA models evaluated on evaluation data more similar to datasets, and both academic researchers and commercial their training data. Thus there is a need for tools that can entities have released such tools. Most existing labeling tools support large datasets creation in a time efficient manner (such as Segment Anything Labeling Tool [6] and Roboflow and low cost manner. We hope to contribute to solving [7]) use single images and therefore require significant this problem by introducing a labeling tool for computer human effort for annotating long sequences, use sequential vision datasets that uses the power of Neural Radiance data but have no geometric understanding so they cannot be Fields (NeRF) [5] for photorealistic rendering and geometric used for annotating 6DOF poses [8], or require depth data understanding. Because 3D Vision can take advantage of 3D to obtain geometric information [9, 10, 11, 12]. Our toolkit, consistency, labeled information about a single scene can be NeuralLabeling, operates on sequences of images and can applied to images from multiple viewpoints. This property thus be used to more rapidly label large datasets, and by works particularly well with photorealistic renderings such using depth reconstruction using NeRF [5] it does not rely as NeRF, where richly annotated data with many views is on input depth data.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > Japan > Honshū > Tōhoku > Miyagi Prefecture > Sendai (0.04)